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Abstract 

The use of regularization, or penalization, has become increasingly common in high- 
dimensional statistical analysis over the past decade, where a common goal is to simultaneously 
select important variables and estimate their effects. It has been shown by several authors that these 
goals can be achieved by minimizing some parameter-dependent "goodness of fit" function (e.g., 
a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that 
are nonsmooth (i.e. not differentiable) at the origin have received substantial attention, arguably 
beginning with LASSO dTibshiranil [19961) . 

The current literature tends to focus on specific combinations of smooth data fidelity (i.e., 
goodness-of-fit) and nonsmooth penalty functions. One result of this combined specificity has been 
a proliferation in the number of computational algorithms designed to solve fairly narrow classes of 
optimization problems involving objective functions that are not everywhere continuously differen- 
tiable. In this paper, we propose a general class of algorithms for optimizing an extensive variety of 
nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed 
framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. 
The resulting algorithms rely on iterated soft-thresholding, implemented componentwise, allowing 
for fast, stable updating that avoids the need for any high-dimensional matrix inversion. We establish 
a local convergence theory for this class of algorithms under weaker assumptions than previously 
considered in the statistical literature. We also demonstrate the exceptional effectiveness of new ac- 
celeration methods, originally proposed for the EM algorithm, in this class of problems. Simulation 
results and a microarray data example are provided to demonstrate the algorithm's capabilities and 
versatility. 

Keywords: Iterative Soft Thresholding, MIST, MM algorithm. 

1 Introduction 



Variable selection is an important and challenging issue in the rapidly growing realm of high- 
dimensional statistical modeling. In such cases, it is often of interest to identify a few important variables 
in a veritable sea of noise. Modern methods, increasingly based on the principle of penalized likelihood 
estimation applied to high dimensional regression problems, attempt to achieve this goal through an 
adaptive variable selection process that simultaneously permits estimation of regression effects. Indeed, 
the literature on the penalization of a "goodness of fit" function (e.g., negative loglikelihood), with a 
penalty singular at the origin, is quickly becoming vast, proliferating in part due to the consideration of 

'Department of Statistical Science, 301 Malott Hall, Cornell University, Ithaca NY 14853 USA 



1 



specific combinations of data fidelity (i.e., goodness-of-fit) and penalty functions, the associated statisti- 
cal properties of r esulting estimators, and the development of several combination-specific optimization 
algor i thms, (e.g.,|j ibshirani , ll996l:EoulE 006: IZou and Hastie . 2005; Zou and Zhang . 2009; Fan and Lii 
l200ll : IPark and HastieL 120071 : iFriedman et all l2008h~ 

In this paper, we propose a unified opti mization framework that appeals to the Majorization- 
Minimization (MM) algorithm (ILangeL 12004) as the primary optimization tool. The resulting class 
of algorithms is referred to as MIST, an acronym for Minimization by Iterative Soft Thresholding. The 
MM algorithm has been co nsidered before for solving specific clas s es of singularly penalized likelihood 
estimation problems (e.g.. iDaubechies et all 2004; H unter and LiL 2005; Z ou and Li , l2008h : to a large 
extent, this work is motivated by these ideas. A distinct advantage of the proposed work is the excep- 
tional versatility of the class of MIST algorithms, their associated ease of implementation and numerical 
stability, and the development of a fixed point convergence theory that permits weaker assumptions than 
existing papers in this area. We emphasize here the focus of this paper is on the development of a stable 
and versatile class of algorithms applicable to a wide variety of singularly penalized estimation prob- 
lems. In particular, the consideration of asymptotic and oracle properties of estimators derived from 
particular combinations of fidelity and penalty functions, as well as methods for effectively choosing 
associated penalty para meters, are not focal points of this paper. A comprehensive treatment of these 
results may be found in Uohnson et al.l (|2008l ). where asymptotics and oracle properties for estimators 
derived from a general class of penalized estimating equations are developed in some detail. 

The paper is organized as follows. In Section 2, we introduce notation and provide sufficient condi- 
tions for local convergence of the MM algorithm applied to a large class of data-fidelity and non-smooth 
penalty functions. In Section 3, we present a specialized version of this general algorithm, demonstrat- 
ing in particular how the minimization step of the MM algorithm can be carried out using iterated 
soft-thresholding. In its most general form, iterated soft-thresholding is required at each minimization 
step; we further demonstrate how to carry out this minimization step in one iteration through a judicious 
choice of majorization function. As a consequence, we present a simplified class of iterative algorithms 
that are applicable to a wide class of singularly penalized, generalized linear regression models. 

Simulation results are provided in Secti on 4, while an applic ation in survival analysis to Diffuse 
Large B Cell Lymphoma expression data (|Rosenwald et all. 120021) is presented in Section 5. We con- 
clude with a discussion in Section 6. Proofs and other relevant results are collected in the Appendix. 



2 MM algorithms for nonsmooth objective functions 

Let i;(JS) denote a real- valued objective function to be minimized for j8 = (fix, . . . ,{Sp) T in some convex 
subset S of W. Let g SUR (fi,a) denote a real- valued "surrogate" objective function, where a e < B. 
Define the minimization map 



M(a) = argmin f UR (fi,a). 



(1) 



Then, if £ s UR {fi, a) majorizes £(/? ) for each a, a generic MM algorithm for minimizing £(/?) takes the 
following form (e.g.. ILangeL 120041) : 



1. Initialize /? (0) 

2. For n > 0, compute - M(Jr n >), iterating until convergence. 
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Provided that the objective function, its surrogate and the mapping M(-) satisfy ce rtain regularit y condi- 
tions, one can establish convergence of this algorithm to a local or global solution. iLangd (120041 . Ch. 10) 
develops such a theory assuming that the objective functions £(/?) and g s UR (fi, a) are twice continuously 
differentiable. For problems th at lack this degree of smoothness (e.g., all sin gularly penaliz ed regres- 
sion p roblems, including lasso, Tibshirani (1996); adaptive lasso, Zou ( 2006b ; and SCAD, Fan and Li 
(1200 11) ). a more general theory of local convergence is required. One such theor y is summariz ed in 
AppendixlA.lt related re sults for the EM algorithm may be found in IWul (119831) . iTsengl ((2004) and 
Chretien and Herol d2008h . 

Let || • || denote the usual Euclidean vector norm. Based on the theory summarized in Appendix I A. 1[ 
we propose a new and general class of algorithms for minimizing penalized objective functions of the 
form 



m = gifi) + Pifi; A) + As\\fi\\ 2 , /I > 0, £ > 



(2) 



where gifi) and p(J3; A) are respectively data fidelity (e.g., negative loglikelihood) and penalty functions 
that satisfy regularity conditions to be delineated below. As will be shown later, the class of problems 
represented by © contains all of the penalized regression problems commonly considered in the current 
literature. It also covers numerous other problems by expanding the class of permissible fidelity and 
penalty functions in a substantial way. 

We assume throughout that gifi) is convex and coercive for yS € S, where S is an open convex subset 
of K. p . We further assume that 



p(fi;A) = Y j pQPj\;A j ), 



(3) 



where the vector A - (A\, . . . , A T p ) T and Aj denotes the block of A associated with fij. It is assumed that 
each Aj has dimension greater than or equal to one, that all blocks have the same dimension, and that 
the Aji = A for each j > 1. Evidently, the case where dim(Aj) = 1 for j > 1 simply corresponds to 
the setting in which each coefficient is penalized in exactly the same way; permitting the dimension of 
Aj to exceed one allows the penalty to depend on additional parameters (e.g., weights, such as in the 
case of the adaptive lasso considered in IZoul (120061) ). We are interested in problems with penalization; 
therefore, A is assumed bounded and strictly positive throughout this paper. Several specific examples 
will be discussed below. For any bounded 6 with A > as the first element, and the remainder of 
collecting any additional parameters used to define the penalty, the scalar function p(r; 6) is assumed to 
satisfy the following condition: 

(PI) p(r;6) > for r > 0; p(0;6) = 0; p(r;6) is a continuously differentiable concave function with 
p'(r; 6) > for r > 0, and, p'(0+; 6) e [M Y ,M e ] for some finite M e > 0. 

Evidently, (PI) implies that p'(r;6) > for r € (0,Ke), where Kg > may be finite or infinite. The 
combination of the concavity and nonnegative derivative conditions thus imply that the penalty increases 
away from the origin, but with a decreasing rate of growth that may become zero. The case where © 
is identically zero for r > is ruled out by the positivity of the right derivative at the origin imposed in 
(PI); similarly, the concavity assumption also rules out the possibility of a strictly convex penalty term. 
Neither of these restrictions is particularly problematic. Our specific interest lies in the development 
of algorithms for estimation problems subject to a penalty singular at the origin. Were ([3]) absent, or 
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replaced by a strictly convex penalty term, the convexity of g(J3) implies Q can be minimized directly 
using any suitable convex optimization algorithm, such as that discussed in Theorem l3.2l below. 

Theorem 12. 1 [ establishes local convergence of the indicated class of MM algorithms for minimizing 
objective functions of the form ©. A proof is provided in Appendix IA.2[ where it is shown that 
conditions imposed in the statement of the theorem are sufficient conditions for the application of the 
general MM local convergence theory summarized in Appendix lA.il 

Theorem 2.1. Let g(fi) be convex and coercive and assume p(fi; A) satisfies both (O and condition (PI ). 
Let h(fi, a) > be a real-valued, continuous function of/3 and a that is continuously dijferentiable in yS 
for each a and satisfies h(fi, a) = when fi = a. Let 



q(J3,a;A) = Y j q(\p j \,\a j \;A j ), (4) 

7=1 

where q(r, s; 6) = p(s; 6) + p'(s; 6)(r - s)for r,s>0, and define 

«A08, a) = h{fi, a) + q(fi, a; A) - p(J3; A). 
Assume the set of stationary points Sfor £(j3),j3 e S is finite and isolated. Then: 

( i) %(fi) in (O is locally Lipschitz continuous and coercive; 

(ii) q(fi, a; A) - pifi; A) is either identically zero or non-negative for all J3 ± a; 

(Hi) f; SUR (fi, a) - %{fi) + ipifi, a) majorizes %{fi) and the MM algorithm derived from % SUR (fi, a) con- 
verges to a stationary point of %(fi) if £ s UR (fi, a) is uniquely minimized in for each a and at 
least one ofh(fi, a) or q(fi, or; A) — p(fi; A) is strictly positive for each fi + a. 

Condition (iii) of Theorem l2. 1 l establishes convergence under the assumption that ^ s UR (J3, a) strictly 
major izes £(/?) and has a unique minimizer in (I for each a. Such a uniqueness condition is shown by 
Vaidal (12005b to ensure convergence of the EM and MM algorithms to a stationary point under more 
restrictive differentiability conditions. Importantly, the assumption of globally strict majorization is 
only a sufficient condition for convergence; this condition is only important insofar as it guarantees a 
strict decrease in the objective function at every iteration. As can be seen from the proof, it is possible 
to relax this condition to locally strict majorization, in which ^ SUR (fi,a) majorizes with strict 
majorization being necessary only in an open neighborhood containing M(a). 



The use of the MM algorithm and selection of (0]) are motivated by the results IZou and Lil (12008): 
we refer the reader to Remark 13.11 below for further comments in this direction. The assumptions on 
g(JS) clearly cover the case of the linear and canonically parameterized generalized linear models upon 
setting g(fi) = -t(JS), where i(fi) denotes the corresponding loglikelihood function. Estimation under 
the semiparametric Cox regression model (|CoxL Il972h and accelerated failure time models are also 
covered upon setting g(fi) to be either the negative logarithm of the parti al likelihood function (e.g. , 



Andersen et a ll 1 19931. Thm VII.2 .1) or the Gehan objective function (e.g., iFygenson and Ritovll 19941 : 



Johnson and Strawderman . 20091) . 

The assumption (PI) on the penalty function covers a wide variety of popular and in teresting exam- 
ples; see Figured] for ill ustration. For example , the lasso (LAS; e.g..lTibshiranil . 119961) . adaptive lasso 
(ALAS; e.g.. lZoull2006 ). elastic net (EN; e.g.. IZou and Hastia. 12005 ). and adaptive elastic net (AEN; 
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e.g., Zou and Zhang . 2009h penalties are all recovered as special cases upon considering the combi- 
nation of © and the ridge-type penalty /le||/?|| 2 . Specifically, with Aj = (A,a>j) T for a>j > 0, taking 
p(r; Aj) = Aojf in © gives LAS {oi j = \,e = 0), ALAS (cjj > 0, e = 0), EN (coj = 1, e > 0) and the 
AEN (ajj > 0, e > 0) penalties. It is easy to see that selecting p(r; Aj) = Acojr also implies the equality 
of (f3]) and ©, a result relevant in both (ii) and (iii) of Theorem 12. 1 1 above. 




The proposed pen alty specification also covers the smoothly clipped absolute deviation (SCAD; 
e.g.. Fan and Lit 1200 lb penalty upon setting p(r;Aj) = ps(r,A,a) for each j > 1, where ps(r;A,a) is 
defined as the definite integral of 



(aA — u)+ 

p' s {u;A,a) = A[I(u < A) + ^- — ttt^u > A)] 



(5) 



(a - 1)A 

on the interval < u < r and some fixed value of a > 2 (e.g., a = 3.7). The resulting penalty function is 
continuously differentiable and concave on r £ [0, oo). The concavity of p$(-; A,a) on [0, oo), combined 
with /?s(0; A, a) = and the fact that /?^(0+; A, a) is finite, implies 

ps (r; A, a) < ps (s; A, a) + p' s (s; A, a)(r - s) (6) 

for each r,s > 0, the bound ary cases for r = and/or s = following from 
Hiriart-Urrut v and Lem arechal Remark 4. 1.2, p. 21). In other words, p$ (r; A, a) can be majorized 



by a linear function of r. 

The l asso penalty, i ts variants, and SCAD have received the greatest attention in the literature. More 
recently, IZhangl (|2007l ) introduced the minimax concave penalty (MCP), which similarly to SCAD is 
defined in terms of its derivative. Specifically, one takes p(r;Aj) = pM(r;A,a) for each j > 1 in ©, 
where Pm(i", A, a) is defined for a > 2 as the definite integral of 



p' M (u;A,a) = 



(7) 



on the interval < u < r and some fixed value of a > 2 (e.g., a - 3.7 as in lFan et al.l. l2009b). Further 
examples of differentiable concave penalties include p(r; Aj) = pair; A, 8) for 

6r 



p G (r;A,6) = A 



l+5r 



, 6>0 



(8) 
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(e.g. lGeman and Reynoldslll992l : lNikoloval . l2000h : and p(r;Aj) = p Y (r, A,S) for 

p Y (r, A,5) = A \og{5r + 1), 5 > 0; 



(9) 



(e.g. Antoniadis et al. , 20091) . These penalties represent just a small sample of the set of possible penal- 
ties satisfying (PI) that one might reasonably consider. 

Remark 2.2. The SCAD and MCP penalties are not strictly concave and lead to surrogate majorizers 
that fail to satisfy the globally strict majorization condition in ( Hi) of Theorem \2.1\ unless h(fi, a) is 
strictly positive whenever + a; see Remark 1X7] for further discussion and also Theorem \3. 4\ below. 



3 MIST: Minimization by Iterated Soft Thresholding 
3.1 The MIST algorithm 

In general, the statistical literature on penalized estimation has proposed optimization algorithms tai- 
lored for specific combinations of fidelity and penalty functions. The class of MM algorithms suggested 
by Theorem l2. 1 I provides a very general and useful framework for proposing new algorithms, the key to 
which is a methodology for solving the minimization problem (0, a step repeated with each iteration of 
the MM algorithm. In this regard, it is helpful to note that the problem of minimizing g s UR (J3, a) for a 
given a is equivalent to minimizing 



#(/?) + As\m 2 2 + h(fi, a) + J p'Qaj\; Aj)\/3j\ 



(10) 



in fi. In particular, if g(fi) + /te|^8|| 2 + h(J3, a) is strictly convex for each bounded or, which clearly occurs 
if both g(J3) and h(fi, a) are convex in j5 and at least one is strictly convex, then (fTTJl) is also strictly 
convex and the corresponding minimization problem has a unique solution. 

Remark 3.1. For e = h(fi, a) - and g(J3) = -((J3)for €{fi) = £" = i b(fi) w ^ tn b(fi) a twice continuously 
differentiate loglikelihood function, the majorizer used by the MM algorithm induced by th e surrogate 



functi on (1101 ) corresponds ( up to sign ) to the minorizer employed in the LLA algorithm o f Zou and Li 



Zou and Li 



(2008), an improvement on the so-called LQA algorithm proposed i mHunter and Li n200$) . 
A2008L Proposition 1 ) assert convergence of their LLA algorithm under imprecisely stated assumptions 
and a re additionally un clear as to the nature of convergence result actually estabished. For example, 
while Zou and Li moh Theorem 1 ) demonstrate that the LLA algorithm does indeed have an ascent 
property, their result appears to be insufficient to ensure that the proper analog of condition Z3( ii) of 
Theorem IA. 1 1 holds in the case of the SC AD penalty . As a consequence, it is unclear whether even weak 
"subsequence " convergence results ( cf. \Wu\ \1983\) hold with useful generality in the case of the LLA 



algorithm. In contrast, Theorem 12. 1 1 shows that strict majorization, under a few precisely stated condi- 
tions, is sufficient to ensure local convergence of the resulting MM algorithm to a stationary point of Q 
In Section \3~2\ it is further demonstrated how a particular choice ofh(fi, a) yields a strict majorizer that 
permits both closed form minimization and componentwise updating at each step of the MM algorithm, 
even in the case of penalties that fail to be strictly concave. 

Numerous methods exist for minimizing a differentiable convex objective function (e.g., 
Boyd and Vandenbergheu2004t) . However, because (fTOl is not differentiable, such methods do not apply 



6 



in the current setting. Specialized methods exist for nonsmooth proble ms of the form (PTOl) in settings 
where gift) has a special structure; a well-known example here is LARS (Efron et all 12004) . which can 
be used to efficiently solve lasso-type problems in the case where gifi) is replaced by a least squares ob- 
jective function. Recently. ICombettes and Wajsl (120051 . Proposition 3.1; Theorem 3.4) proposed a very 
general class of fixed point algorithms for minim izing f](h)+ f?(h ), where /;(•), i= 1, 2 are each convex 
and h takes values in some r eal Hilbert space "H. lHale et all (I2008L Theorem 4.5) specialize the results of 
Combettes and Wai sl d2005l) to the case where *H is some subset of W and /2(h) — \ n i\- The collec- 



tive application of these results to the problem of minimizing (fTOl t generates an iterated soft-thresholding 
procedure with an appealingly simple structure. Theorem 13.21 given below, states the algorithm, along 
with conditions under which the algorithm is guaranteed to converge; a proof is provided in Appendix 
IA.3I The resulting class of procedures, that is, MM algorithms in which the minimization of (TTOb is car- 
ried out via this iterated soft-thresholding procedure, is hereafter referred to as MIST, an acronym for 
(M)inimization by (I)terated (S)oft (T)hresholding. Two important and useful features of MIST include 
the absence of high-dimensional matrix inversion and the ability to update each individual parameter 
separately. 

Theorem 3.2. Suppose the conditions of Theorem 12. 1 1 hold. Let m(fi) = g{J3) + h(fi,a) + /le|^8|| 2 be 
strictly convex with a Lipschitz continuous derivative of order L~ l > for each bounded a. Then, for 
any such a and a constant tu € (0, 2L), the unique minimizer of (1101 ) can be obtained in a finite number 
of iterations using iterated soft-thresholding: 

1. Set n = 1 and initialize 

2. Compute d (n) = b ( ' ,_1) - -nrVmCb^" 1 )) 

3. Compute b^ 1 ' = S(d^; vjt), where for any vectors u, v e W\ 



S(u;\) = 2_j s(uj,vj)ej, (11) 
7=1 

e j denotes the j unit vector for R p , 

s(uj,vj) = sign(uj)(\uj\ - vj)+, (12) 
is the univariate soft-thresholding operator, and 

T = (p'(\a l \;A l ),...,p'(\a p \;A p )) T . 

4. Stop if converged; else, set n — n + 1 and return to Step 2. 

The proposed algorithm, as originally developed in lCombettes and Wajsl ( 2005 ). is suitable for min- 
imizing the sum of a differentiable convex function m(-) and another convex function; hence, under 
similar conditions, one could employ this algorithm directly to minimize ( El) in cases where th e penalty 
© is derived from some convex function p(-;6). Theorem 3.4 of ICombettes and Wai si d2005b further 
shows that the update in Step 3 can be generalized to 



b ( "- y) + 5 n [S {b in ' l) - w n Vm(y> in - l) );w n T) - b ( " _1) ] , 
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where, for every n, m n € (0, 2L) and 5 n e (0, 1] is a suitable sequence of relaxation constants. Judicious 
selection of these constants, possibly updated at each step, may improve the convergence rate of this 
algorithm. 

Theorem 13.21 imposes the relatively strong condit ion that the gradient of m(fi ) is L _1 -Lipschitz con- 



tinuous. The role of this condition, also imposed in lCombettes and Wa is (2005, Proposition 3.1; The- 
orem 3.4), is to ensure that the update at each step of the proposed algorithm is a contraction, thereby 
guaranteeing its convergence to a fixed point. To see this, note that the update from b (n) to b^" +1) in the 
algorithm of Theorem 13.21 involves the mapping S(b - crVm(b); err) . For any bounded b and a, it is 
easily shown that 

||5(b - tD-Vm(b); vjt) - S (a - crVm(a); tut) || < ||b - a - cr(Vm(b) - Vra(a)) ||. 

When Vm(b) = Vra(a), the right-hand side reduces to ||b - a||, and the resulting mapping is only nonex- 
pansive (i.e., not necessarily contractive). However, under strict convexity, this situation can occur only 
if b = a. In particular, suppose that b (M) + \s (n ~ X) ; then, Vm(b (n) ) + Vm(b ( " _1) ) and, using the mean value 
theorem, 

|| b (» + D _ b (n)|| = || 5 ( b W _ wV m(b (n) ); tut) - s(b {n - y) - EjVm^'"" 1 '); tut) \\ 
< ||7 - tuH(b (n \b {n - 1) )\\ ||b (,l) - b (n_1) ||, 

where H(b, a) - J^ 1 Vm(b + ?(a - b))dt. The assumption that the gradient of m(fi) is L^-Lipschitz 
continuous now implies that choosing tu as indicated guarantees ||7 - TuHib^,^"' 1 ^)]] < 1, thereby 
producing a contraction. 



In view of the generality of the Contraction Mapping Theorem (e.g., iLuenberger and Yd 2008, 
Thm. 10.2.1), it is possible to relax the requirement that Vm(jS) is globally L _1 -Lipschitz continuous 
provided that one selects a suitable starting point. The relevant extensio n is summa r ized i n the corollary 



below; one may prove this result in a manner similar to Theorem 4.5 of [Hale et a l. (2008). 



Corollary 3.3. Let the conditions of Theorem 12. 1 \ hold. Suppose a is a bounded vector and assume that 
m(J3) = g{fi) + h(fi, a) + Ae\\p\\ 2 is strictly convex and twice continuously differentiable. Then, for a given 
bounded a, there exists a unique minimizer /?*. Let Q,be a bounded convex set containing /?* and define 
A max (J3) to be the largest eigenvalue ofV 2 m(J3). Then, the algorithm of Theorem \3.2\ converses to fi* in 
a finite number of iterations provided that b' > € Q., A* = max^n A max {ji) < oo, and m € (0, 2/ A*). 

Some useful insight into the form of the proposed thresholding algorithm can be gained by consid- 
ering the behavior of the penalty derivative term p'(r; 6). As suggested earlier, (PI) implies that p'(r; 6) 
decreases from its maximum value towards zero as r moves away from the origin. For some penal- 
ties (e.g., SCAD, MCP), this derivative actually becomes zero at some finite value of r > 0, resulting 
in situations for which tj = p'(]aj\; Af) = for at least one j. If this occurs for component j, then 
/* component of the vector S^b*^ - crVm(b ( "'); wt) simply reduces to the component of the ar- 
gument vector b (,,) - inVm(b^). In the extreme case where t = 0, the proposed update reduces to 
b («+i) _ _ G7 Vm(b ( ' l) ), an inexact Newton step in which the inverse hessian matrix is replaced by 
crip, \ p denoting the p x p identity matrix, and with step-size chosen to ensure that this update yields a 
contraction. Hence, if each of the components of b (,i) - tzrVm(b^" ) ) are sufficiently large in magnitude, 
the proposed algorithm simply takes an inexact Newton step towards the solution; otherwise, one or 
more components of this Newton-like update are subject to soft-thresholding. 
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3.2 Penalized estimation for generalized linear models 

The combination of Theorems 12. l[|3.2l and Corollary [33] lead to a useful and stable class of algorithms 
with the ability to deal with a wide range of penalized regression problems. In settings where g(J3) is 
strictly convex and twice continuously differentiable, one can safely assume that h(J3, a) - for all 
choices of fi and a provided that p'(r; 6) in (PI) is strictly positive for r > 0; important examples of sta- 
tistical estimation problems here include many comm only used linear and generalized linear regression 



models, semiparametric Co x regression (iCoxl. 1 19721). and sm oothed versions of the accelerated failure 



time regression model (cf. iJohnson and Strawdermanl . l2009h . The SCAD and MCP penalizations, as 



well as other penalties having p'(r; 6) > for r > 0, can also be used; however, additional care is 
required. In particular, and as pointed out in an earlier remark, if one sets h(J3, a) = for all j5 and a 
then convergence of the resulting algorithm to a stationary point is no longer guaranteed by the above 
results due to the resulting failure of these penalties to induce strict majorization. 

The need to use an iterative algorithm for repeatedly minimizing (PTOl ) is not unusual for the class 
of MM algorithms. However, it turns out that for certain choices of g(J3), a suitable choice of h(fi, a) 
in Theorem 13.21 guarantees both strict majorization and permits one to minimize (fTOl) in a single iter- 
ation, resulting in a single soft-thresholding update at each iteration. Below, we demonstrate how the 
MIST algorithm simplifies in settings where g(JS) corresponds to the negative loglikelihood function of a 
canonically parameterized generalized linear regression model having a bounded hessian function. The 
result applies to all penalties satisfying condition (PI), including SCAD and MCP. A proof is provided 
in Appendix IA.4I 

Theorem 3.4. Lety be N x 1 and suppose the probability distribution of y follows a generalized linear 
model with a canonical link and linear predictor Xfi, where X = [1^,X] is Nx(p+ 1) and/3 = \J3q,/3 t ] t 
is {p + 1) X 1 with ySo denoting an intercept. Assume that g(fi) = —t(fi), where 

m=l T (c(y)-b(rj))+y T Tl 

denotes the corresponding loglikelihood; here, 7j = XJ3 and E\yi\ - b'(rji)for i - 1 . . . N for b(-) strictly 
convex and twice continuously differentiable. Let the penalty function be defined as in © and satisfy 
(PI); note that ySo remains unpenalized. Define 

h(fi,a) = ((ft) - £(a) - Vt(a) T (fi - a) + m~ l (fi- a? (ft - a); (13) 

where a - [cvq, a T ] T is (p + 1) X 1, and m is defined as in Corollarv \3.3\ Then: 

1. The objective function (O, say £ g i m (fi), is majorized by 

p 

+m- i (fi - af(fi - a) + ^(t^I + 7j + Aetf) (14) 

7=1 

where tj = p'QaA; A;) andyj - p(\a;\;Aj) — p'(\aj\; Aj)\aj\ are bounded, nonnegative, and func- 
tionally independent off}. 

2. The functions g(fi) - —£(JS) and h(fi, a) satisfy the regularity conditions of Theorems \2.1\ and \3.2\ 
hence, the corresponding MM algorithm converges to a stationary point of (0. 
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3. For each bounded a, 

(a) the minimizer fi* o/^™(j8, or) is unique and satisfies 



K, = ao + y[W(S)]o (15) 



where S (• ; •) is the soft-thresholding operator defined in (1111 ) and & = {1, . . . , p). 
(b) for each k = [k , k t ] t e 1&> +x \ 

? g ™ 08* + *,<*)> or) + m' 1 P|| 2 . (16) 

In view of previous results, the result in # 3 of Theorem 13 .4 1 shows that the resulting MM algorithm 

—iii) 

takes a very simple form: given the current iterate fi , 



(fl) 

1. update the unpenalized intercept fi : 



«(»+!) _ ffn) , CT 
^0 + o 



2. update the remaining parameters 



in). 



& n+1) = j^-^s + j [ve0 (n W jr^j , (17) 

where r {n) = (p'(\fi\ Xi) p'{\pf\\ A p )) T . 

The specific choice of function h(fi,a) clearly serves two useful purposes: (i) it leads to 
componentwise-soft thresholding; and, (ii) it leads to strict majorization, as is required in condition 
(iii) of Theorem 12.11 allowing one to establish the convergence of MIST for SCAD and other penalties 
having p'(r, 6) - at some finite r > 0. 

Evidently, the algorithm above immediately covers the setting of penalized linear regression. For 
example, suppose that y has been centered to remove /?o from consideration and that the problem has 
also been rescaled so that X, which is now N x p, satisfies the indicated conditions. Then, the results of 
the Theorem 13.41 apply directly with 

-£(fi) = \\\Xp-y\\\ W(y8) = X r (y - Xfi), h(fi,a) = m-\\p-a\\ 2 - l -\\Xp-Xat, 

where w is defined as in Corollary [331 For the class of adaptive elastic net penalties ( i.e., p(r; Xj) = AiOjr 
in ([3])), the resulting iterative scheme is exactly that proposed in ( De Mol et al. . 20081 pg. 17), specialized 
to the setting of a Euclidean parameter. In particular, we have tj = Acoj and yj = in Theorem I3.4[ and 
the proposed update reduces to 



P(k + 1) = 1 5 /(yj _ X / X) ft) + X / y . ^ f 

v + 2Ae v 



where v = 2m . Setting v = 1 and e - yields the iterative procedure proposed in iDaubechies et al. 



d2004l) . provided that X'X is scaled such that I -X'X is positive definite. The proposed MIST algorithm 
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extends these iterative componentwise soft-thresholding procedures to a much wider class of penalty 
and data fidelity functions. 

The restriction to canonical generalized linear models in Theorem 13.41 is imposed to ensure strict 
convexity of the negative loglikelihood. Our results are easily modified to handle non-canonical gener- 
alized linear models, provided the negative loglikelihood remains strictly convex in fi and the hessian 
can be appropriately bounded. Interestingly, not all canonically parameterized generalized linear mod- 
els satisfy the regularity conditions of Theorem |3.4| One such important class of problems is penalized 
likelihood estimation for Poisson regression models. For example, in the classical setting of N indepen- 
dent Poisson observations with Z?[F;|X,-] = dj exp{x ; r /?} for a known set of constants d\ . . . d^, we have 
(i.e., up to irrelevant constants) £(fi) - - £v.j fi(xffi), where 

fi(u) = dte 11 - ytu. 



It is easy to see that W(/?), hence Vm(J3), is locally but not globally Lipschitz continuous; hence, it is 
not possible to choose a matrix C = m~ l I such that (fT4l everywhere majorizes ^ g i m (fi). Nevertheless, 

progress remains possible. For example, Corollary 13.31 implies that that one can still use a single update 

— (0) 

of the form (fTTT ) provi ded that a suitable Q., hence C and j3 , can be identified. Alternatively, using 
results summarized in lBecker et all (|1997T) . one can instead majorize -t{fi) for any bounded a using 



k(fi, a) = J kj(J3f, aj) for kj(fifi try) - J] I{x u * 0} 9 U fi [%(fij- aj) + xja ] , 



1=1 



1 and Gtj > if x,j * 0. 



where, for every i, 9^- > are any sequence of constants satisfying Zy = o 
Of importance here is the fact kj(fif,aj) is a strictly convex function of /By and does not depend on 
for k + j. One may now take h(J3,a) in Theorem \2. II as being equal to k{fi,a) + €{fi), leading to the 
minimization of 



f VR (fi,a) cx Yjkjtff, aj ) + Asp) + p'Qajl; 10$ + k (p , or ). 



(18) 



In particular, componentwise soft-thresholding is replaced by componentwise minimization of (TT8T ). the 
latter being possible using any algorithm capable of minimizing a continuous nonlinear function of one 
variable. 



Remark 3.5. The Cox proportional hazards model I Cox 197% . while not a generalized linear model, 
shares the essential features of the generalized linear model utilized in Theorem \3.4\ In particular, the 
negative log partial likelihood, sa y g(/3) = —€ p (ff), is str i ctly convex, twice cont i nuous ly differentiable, 
and has a bounded hessian (e.g., Bohning and LindsaA. 198 A ' Andersen et al. 199% . Consequently, 
Theorem \3.4\ and its proof are easily modified for this setting upon taking g(fi) as indicated, setting 
h(fi, a) = l p (fi) - £ p (a) - V£ p (a) T (fi -a) + m" l \\fi- a\\ 2 , and taking m as defined as in Corollary \3.3\ 



3.3 Accelerating Convergence 

Similarly to the EM algorithm, the stability and simplicity of the MM algorithm frequently comes 
at the price of a slow convergence rate. Numerous methods of accel erating the EM algorithm 
have been proposed in the literature; see iMcLachlan and Krishnan (2008) for a review. Recently, 
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Varadhan and Roland (2008) proposed a new method for EM called SQUAREM, obtained by "squaring" 
an iterative Steffensen-type (STEM) acceleration method. Both STEM and SQUAREM are structured 
for use with iterative mappings of the form 6 n+ \ = M{9„), n - 0, 1, 2, . . . , hence applicable to both the 
EM and MM algorithms. Specifically, the acceleration update for SQUAREM is given by 



9„ +1 = 9 n - 2y n (M(0 n ) - 9 n ) + yi[M(M(9 n )) - 2M(6 n ) + 9 n ] 
= 9 n - 2y n r n + y 2 n v n , 



(19) 



where r n = M(9„) — 9 n and v„ = (M(M(9 n )) - M{9 n )) - r n for an adaptive steplength y n . 



Varadhan and Rolandl d2008b suggest se veral steplength options, with preference for choice y n = 
vJI. iRoland and Varadhanl (2005) provide a pr oof of local convergence for SQUAREM under 



- r, 



restrictive conditions on the EM mapping M{9), while IVaradhan and Roland (2008) outline a proof for 
global convergence for versions of SQUAREM that employ a back- tracking strategy. We study the 
effectiveness of SQUAREM applied to the simplified form of the MIST algorithm, hereafter denoted 
SQUAREM 2 , in Section^ 



4 Simulation Results 

The simulation results summarized below are intended to compare the estimates of fi obtained from 
existing methods to those obtained using the simplified MIST algorithm of Theorem 13-41 In partic- 
ular, we consider the performance of MIST for the class of penalized linear and generalized linear 
models, demonstrating its capability of recovering the solutions provided by existing algorithms when 
both algorithms are forced to use the same set of "tuning" parameters (i.e., penalty parameter(s), plus 
any additional parameters required to define the penalty itself). In cases where multiple local minima 
can arise, we further show that the MIST algorithm often tends to find solutions with lower objective 
function evaluations in comparison with existing algorithms, provided these algorithms utilize the same 
choice of starting value. 



4.1 Example 1: Linear Model 

Let \ m and (L , respe ctively denote m-dimensional vectors of ones and zeros. Then, following 



Zou and Zhan g (2009), we generated data from the linear regression model 

y = x'J3* + s (20) 

where /?* = (3 • \ T q , ^ T p - q ) T is a /^-dimensional vector with intrinsic dimension q = 3[p/9], s ~ N{0, o 2 ), 
and x follows a p-dimensional multivariate normal distribution with zero mean and covariance matrix E 
having elements = p lj ~ kl , 1 <k,j < p. We considered cr e {1, 3}, p e {0.0, 0.5, 0.75} and p € {35, 81} 
for Af = 100 independent observations. 

Penalized least squares estimation is considered for five p opular choices of penalty function s, all of 
which are currently implemented in the R software language dR Development Core TeamLl2005l) : LAS, 



ALAS, EN, AEN, and SCAD. The LAS, ALAS, EN and AEN penalties are all convex and lead to 
unique solutions under mild conditions; the SCAD penalty is concave and the resulting minimization 
problem may have multiple solutions. In each case, we used existing software for computing solutions 
subject to these penalizations and compared those results to the solutions computed using the MIST 
algorithm. 
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Regarding existing methods, we respectively used the lars (IHastie and EfronL 120071) and elasticnet 
(IZou and Hastid. |2008) packages for computing solutions in the case of the LAS and EN penalties. 
For the ALAS and AEN penalties, we used software kindly provided by lZou and Zhangl (|2009) that also 
makes use of the elasticnet package. The weights for the AEN penalty are computed using oil = \6 EN \~ y , 

*EN 

j = 1 , . . . , p, where p is an EN estimator and 7 is a positive constant. Using EN-based weights in 
the AEN fitting algor ithm necessitates tuning parameter specification for both EN and AEN. As in 



Zou and Zhangl (120091) . the l\ parameters A (Ai in their notation) are allowed to differ, whereas the I2 
parameters e (A2 in their notation) are forced to be the same. Evidently, setting e = (Ap = 0) results 
in the ALAS solution. For the SCAD penalty, we consi dered the es t imator of iKim et all (120081) (HD), 
as well the one-step SCAD (IS) and LLA estimators of IZou and Lil (|2008h . The code for the first two 
methods was kindly provided by their respective authors; the LLA estimator was computed using the 
R package 5/5. The choice a = 3.7 was used for all implementations of SCAD. 

We considered finding solutions using penalties in the set A = {0.1, 1,5, 10,20, 100}. In particular, 
for LAS and SCAD, A - Ai e A. For EN, both A - A\ e A and Ae - A% 6 A. For simplicity, we 

~EN 

fixed the weights for AEN for a given A2 by selecting the 'best' j3 among the six estimators involving 
A — A\ e A based on a BIC-like criteria. Likewise for ALAS, the weights were computing using the 
'best' fi LAS among the six estimators involving A = A\ e A. The parameter y for the ALAS and AEN 
penalties was respectively set to three and five for p = 35 and p = 8 1 . 

For the strictly convex objective functions associated with the LAS, ALAS, EN, and AEN penalties, 
we simply used a starting value of /? (0) = p . For SCAD, three different starting values for the MIST, HD, 
and LLA SCAD algorithms were considered: = Q p , j8 (0) = p mi (i.e., the unpenalized least squares 
estimate), and jS (0) = p xs A (i.e., the one-step estimate computed using the penalty A). As in lZou and Li 



(2008), the one-step estimator IS is computed using fi m[ , an appropriate choice when N > p. 

The convergence criteria used by the existing software packages were used without alteration in this 
simulation study. The convergence criteria used for MIST were as follows: the algorithm stopped if 
either (i) the normed difference of successive iterates was less than 10~ 6 (convergence of coefficients); 
or, (ii) the difference of the objective function evaluated at successive iterates was less than 10~ 6 and the 
number of iterations exceeded 10 6 (convergence of optimization). Due to the large number of compar- 
isons and highly intensive nature of the computations, we ran B = 100 simulations for each choice of p, 
o~, and p. We report the results for the convex penalties in Table Q] and those for the SCAD penalty in 
Tables |2]and|3] 

In Tabled] we summarize the average normed difference between the solution obtained using exist- 
ing software and that obtained using MIST, \\p exist -p mi J , over the B - 100 simulations; in particular, 
we report in the two leftmost panels the maximum value of this difference, computed across all com- 
binations of tuning parameters. These maximum differences (all of which are multiplied by 10 5 ) are 
remarkably small for all (A)LAS and (A)EN penalties, indicating that MIST recovers the same (unique) 
solutions as the existing algorithms. Interestingly, the values for LAS are slightly larger than the rest, 
where the maximum differences all resulted from the smallest value of A considered (A = 0.1). In 
these cases, the algorithm tended to stop using the objective function criteria rather than the (stricter) 
coefficient criteria, resulting in slightly larger differences on average. 

The results for SCAD are reported in Tables |2](/? = 35) and[3](p = 81) and display (i) the average 
normed differences, multiplied by 10 3 , for each combination of A, p, <r, p and stalling value; and, (ii) the 
proportion of simulated datasets in which the MIST solution yields a lower evaluation of the objective 
function in comparison with the solution obtained using another method for the indicated choice of 



13 



Table 1: Maximum average normed differences (xlO 5 ) over B = 100 simulations for Examples 1 (LM) and 2 
(GLM). 



p 


LM : cr 


= 1 


LM : cr - 


3 


GLM 





0.5 


0.75 





0.5 


0.75 







0.5 


0.75 


p — 35 
















= 25 






LAS 


0.10 


0.35 


1.45 


0.10 


0.37 


1.56 




0.07 


4.28 


6.17 


ALAS 


0.03 


0.14 


0.64 


0.05 


0.21 


1.00 




1.84 


2.86 


3.76 


EN 


0.07 


0.19 


0.50 


0.07 


0.20 


0.51 




2.30 


5.61 


8.68 


AEN 


0.03 


0.10 


0.33 


0.04 


0.13 


0.36 




1.47 


3.35 


5.27 


p = 81 














q 


= 75 






LAS 


1.73 


3.82 


11.76 


2.33 


5.78 


18.99 




0.10 


6.97 


9.94 


ALAS 


0.12 


0.38 


1.58 


0.35 


1.03 


4.39 




1.34 


2.55 


3.30 


EN 


0.31 


0.49 


0.87 


0.31 


0.49 


0.88 




2.35 


4.64 


6.56 


AEN 


0.14 


0.22 


0.56 


0.16 


0.26 


0.56 




1.27 


2.29 


2.85 



starting value. The results for A = 100 are not shown, as the solution was p in all cases. In comparison 
with the convex penalties, larger normed differences are observed, even when controlling for the use of 
the same starting value. Such differences are a result of two important features of the SCAD optimization 
problem: (i) the possible existence of several local minima; and, (ii) the fact that the MIST, HD, and 
LLA algorithms each take a different path from a given starting value towards one of these solutions. 
For example, while each of the LLA, MIST, and HD algorithms involve majorization of the objective 
function using a lasso-type surrogate objective function, both the majorization and minimization of the 
resulting surrogate function are carried out differently in each case. In particular, the LLA algorithm, as 
implemented in SIS, majorizes only the penalty term and adapts the lasso code in glmpath in order to 
minimize the corresponding surrogate objective function at each step. The HD algorithm is similar in 
spirit, but instead d ecomposes the penalty t erm into a sum of a concave and convex function and utilizes 
the the algorithm of iRosset and Zhul (|2007T) to minimize the corresponding surrogate objective function. 
The MIST algorithm instead uses the same penalty majorization as the LLA algorithm, but additionally 
majorizes the negative loglikelihood term in a way that permits minimization of the surrogate function 
in a single soft-thresholding step. Each procedure therefore takes a different path towards a solution, 
even when given the same starting value. 

We remark here that differences must also expected between any of LLA, HD, MIST and the one- 
step solution IS; from an optimization perspective, the one-step estimate is the result of runni ng just one 
iterat ion of the LLA algorithm, starting from the unpenalized least squares estimator J3 mj (|Zou and Li . 



2008), and only provides an approximation the solution to the desired minimization problem. All other 



methods (LLA, MIST, HD) iterate until some local minimizer (or stationary point) is reached. For ex- 
ample, when using either (l ml or fi ls A as the starting value, MIST always found a solution that produced 
a lower evaluation of the objective function in comparison to A . However, when using the null start- 
ing value of P , the one-step estimator did occasionally result in a lower objective function evaluation in 
cases involving smaller values of A. This behavior is not terribly surprising; with small A, the one-step 
solution should generally be close to the unpenalized least squares solution, as the objective function 
itself is likely to be dominated by the least squares term. 

Of all the SCAD algorithms considered here, MIST and LLA tended to find the most similar solu- 
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tions (i.e., have the smallest normed differences). For the cases in which the LLA solution had lower 
objective function evaluations, all of the MIST solutions were also LLA solutions; i.e, when starting the 
LLA algorithm with the MIST solution, the algorithm terminated at the starting value (i.e., the LLA so- 
lution coincides with the MIST solution). With the exception of three of these cases, starting the MIST 
algorithm with the LLA solution also resulted in the same behavior. For the most part, the HD and MIST 
algorithms also gave similar results, with one source of difference being the respective stopping criteria 
used. The stopping criteria for HD, assessed in order, are as follows: (1) 'convergence of optimization': 
stop if the absolute value of the difference of the objective evaluated at successive iterates is less than 
le-6; (2) 'convergence of penalty gradient': stop if the sum of the absolute value of the differences of 
the derivative of the centered penalty evaluated at successive iterates is less than le-6, (3) 'convergence 
of coefficients:' stop if the sum of the absolute value of the differences of successive iterates is less than 
le-6, and (4) 'jump-over' criteria: stop if the objective at the previous iterate plus le-6 was less than the 
objective at the current iterate. After careful analysis of the results, we can assert the following: 

• The MIST solution usually has the same or a lower evaluation of the objective function in com- 
parison with HD, regardless of starting value. 

• HD tends to have the greatest difficulty in cases of high correlation between predictors, a likely 
result of the fact that this algorithm relies on the variance of the unpenalized least squares estima- 
tor, hence matrix inversion, to take steps towards solution. In contrast, MIST requires no matrix 
inversion. 

On balance, the MIST algorithm performs as well or better than LLA and HD in locating minimizers 
in nearly all cases. As suggested above, variation in the solutions found can be traced to the path each 
algorithm takes towards a solution and differences in stopping criteria. Remarkably, in cases when the 
correlation among predictors was low, the choice of starting value made little difference for MIST; either 
the same solution was found for all starting values or none of the starting values dominated in terms of 
finding the lower or equivalent objective evaluations. In settings involving higher correlation, however, 
using either P or the IS starting values tended to result in the lower evaluations of the objective function 
in comparison with using the unpenalized least squares solution. Similar behavior was observed for the 
LLA algorithm. In contrast, the choice of starting value had a much larger impact on the performance of 
the HD estimator; in particular, the use of p as a starting value typically resulted in the lowest objective 
function evaluations when compared to using a non-null starting value. 



4.2 Example 2: Binary Logistic Regression 



As in Example 1, we considered the LAS, ALAS, EN, AEN, and SCAD penalties. There are at least 



two R packages that allow penalization using the LAS and EN penalties: glmpath (IPark and Hastie , 
2007 ), which handles b inomial and poisson regression using a "predictor-corrector" method, and glmnet 
(Frie dman et al. , 12008). which handles binomial and multinomial regression using cyclical coordinate 
descent. Both methods can be tuned to find the same solutions, so for ease of presentation we only con- 



sider the results of glmnet for comparison in the tables and analysis below. The SIS package dFan et al. 



2009a) permits compu tations with the ALAS, AEN, and SCAD penalties using modifications of the 
Park and Hastid (120071) code. For S CAD, we compare d the results of MIST to the results from the one- 
step (IS) algorithm (GLM version . IZou and LiL 120 08) using the code provided from the authors and the 
LLA algorithm as implemented in lFan et al. d2009ah . 
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Table 2: Algorithm performance in Example 1 (LM: p = 35, N = 100) for SCAD penalty. The column 'avg' is 
the average normed differences xlO 3 between the MIST solution and the existing method's solution ; 'prop' is the 
proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing 
method's solution. 



Jf0) 
p method 



0„ 



avg prop 



a ml 



avg prop 



As. 



avg prop 



0„ 



avg prop 



avg prop 



avg 



A = .l 







0.5 



0.75 



HD 

IS 

LLA 
HD 

IS 

LLA 
HD 

IS 

LLA 



15.71 
99.13 
0.43 
7.07 
192.22 
6.65 
29.25 
575.09 
23.81 



1.00 
1.00 
1.00 
0.99 
1.00 
0.99 
0.99 
1.00 
0.98 



15.41 
99.13 
0.46 
10.72 
192.01 
0.62 
105.39 
488.09 
23.34 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.92 
1.00 
0.99 



17.93 
99.13 
0.46 
2.04 
192.00 
0.60 
66.83 
486.19 
1.67 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.96 
1.00 
0.99 



468.55 
211.17 
2.07 
269.85 
483.89 
57.87 
2335.23 
1417.97 
558.56 



1.00 
1.00 
1.00 
0.97 
0.98 
0.96 
1.00 
0.86 
0.73 



2076.40 
211.16 
1.96 
218.94 
421.17 
12.84 
2758.43 
604.26 
69.30 



1.00 
1.00 
1.00 
0.94 
1.00 
0.99 
0.98 
1.00 
0.96 



55.17 
211.16 
2.02 
130.76 
419.15 
2.37 
2731.10 
629.21 
44.87 



A= 1 



0.5 



0.75 



HD 

IS 

LLA 
HD 

IS 

LLA 
HD 

IS 

LLA 



6.22 
694.59 
1.64 
300.62 
4489.01 
296.53 
3083.00 
7224.77 
2802.66 



1.00 
1.00 
1.00 
0.98 
1.00 
0.98 
0.68 
1.00 
0.66 



22.87 
694.57 
1.71 
34.09 
4276.77 
7.10 
1980.40 
5491.09 
1121.80 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.89 
1.00 
0.85 



19.99 
694.57 
1.74 
115.76 
4261.64 
88.14 
1138.53 
5622.21 
293.50 



1.00 
1.00 
1.00 
0.98 
1.00 
0.98 
0.96 
1.00 
0.96 



9.44 
844.68 
1.47 
303.98 
3547.69 
248.82 
1476.59 
5682.04 
1365.76 



1.00 
1.00 
1.00 
0.96 
1.00 
0.96 
0.84 
0.96 
0.83 



35.16 
844.67 
1.47 
140.26 
3254.16 
2.66 
1669.60 
3835.30 
918.63 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.93 
1.00 
0.89 



14.65 
844.67 
1.43 
94.90 
3254.16 
2.66 
868.21 
3748.35 
433.66 



-1 = 5 



0.5 



0.75 



HD 

IS 

LLA 
HD 

IS 

LLA 
HD 

IS 

LLA 



18.18 
48.23 
0.00 
0.01 
3696.85 
0.02 
0.27 
3977.93 
0.27 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



18.18 
48.23 
0.00 
0.01 
3696.85 
0.09 
0.27 
3977.93 
0.45 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



18.18 
48.23 
0.00 
0.01 
3696.85 
0.08 
98.05 
4045.81 
98.35 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



17.73 
63.63 
0.00 
0.01 
3751.96 
0.03 
19.20 
4170.49 
19.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.99 
1.00 
0.99 



17.73 
63.63 
0.00 
0.01 
3751.96 
0.14 
19.21 
4170.49 
19.20 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.99 
1.00 
0.99 



17.73 
63.63 
0.00 
0.01 
3751.96 
0.08 
99.95 
4180.79 
100.05 



A= 10 
HD 
IS 

LLA 
HD 

IS 

LLA 
HD 

IS 

LLA 



0.5 



0.75 



0.00 
0.00 
0.00 
57.33 
501.86 
0.01 
0.41 
4206.65 
0.09 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
57.33 
501.86 
0.03 
0.41 
4206.65 
0.30 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
57.33 
501.86 
0.01 
0.41 
4206.65 
0.14 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
53.80 
497.87 
0.01 
0.53 
4261.12 
0.07 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
53.80 
497.87 
0.04 
0.53 
4261.12 
0.36 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
53.80 
497.87 
0.01 
0.53 
4261.12 
0.10 



A = 20 



0.5 



0.75 



HD 

IS 

LLA 
HD 

IS 

LLA 
HD 

IS 

LLA 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
33.90 
47.21 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
33.90 
47.21 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
33.90 
47.21 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
35.46 
46.90 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
35.46 
46.90 
0.06 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
35.46 
46.90 
0.00 
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Table 3: Algorithm performance in Example 1 (LM: p = 81, AT = 100) for SCAD penalty. The column 'avg' is 
the average normed differences (xlO 3 ) between the MIST solution and the existing method's solution ; 'prop' is 
the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing 
method's solution. 



0(0) 
p method 



0„ 



avg prop 



P 



'ml 



avg prop 



avg prop 



0„ 



avg prop 



avg prop 



Pis,. 



avg 







0.5 



= .1 
HD 
IS 

LLA 
HD 

IS 

LLA 
0.75 HD 
IS 

LLA 



828.22 
753.85 
1.60 
5992.88 
1217.02 
24.78 
12018.61 
2492.18 
36.95 



1.00 
1.00 
1.00 
1.00 
1.00 
0.97 
1.00 
1.00 
0.98 



1211.97 
753.84 
1.67 
6008.14 
1202.01 
1.33 
12042.97 
2327.76 
90.89 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.97 



962.10 
753.84 
1.64 
5994.86 
1201.30 
8.50 
12042.90 
2330.54 
90.69 



1.00 
1.00 
1.00 
1.00 
1.00 
0.99 
1.00 
1.00 
0.96 



4615.10 
2836.29 
1181.62 
8002.08 
4619.22 
2123.22 
13582.93 
8204.45 
3517.93 



1.00 
0.90 
0.76 
1.00 
0.88 
0.57 
1.00 
0.60 
0.50 



5414.49 
1314.46 

382.17 
9530.30 
1473.61 

576.65 
16580.85 
1215.98 

607.08 



1.00 
1.00 
0.82 
1.00 
1.00 
0.83 
1.00 
1.00 
0.78 



5350.54 
1366.62 

223.32 
9546.21 
1403.36 

232.10 
16569.80 
1181.16 

252.75 



A= 1 
HD 
IS 

LLA 
0.5 HD 

IS 

LLA 
0.75 HD 
IS 

LLA 



1421.70 

7121.11 
50.48 

4505.31 
11973.29 

1622.24 
11166.35 
16953.51 

6379.56 



1.00 
1.00 
0.99 
0.93 
1.00 
0.89 
0.75 
1.00 
0.50 



3595.88 
6977.35 
64.69 
6764.71 
10301.59 
661.69 
16786.90 
9125.82 
4295.69 



1.00 
1.00 
0.99 
0.88 
1.00 
0.95 
0.57 
1.00 
0.63 



2296.03 
6976.16 
4.59 
4973.51 
10238.21 
622.25 
11642.59 
9225.76 
787.30 



1.00 
1.00 
1.00 
0.98 
1.00 
0.96 
0.84 
1.00 
0.93 



1552.11 

7485.99 
231.48 

4571.62 
12411.82 

1682.66 
12834.39 
17174.91 

6904.11 



0.98 
1.00 
0.97 
0.97 
1.00 
0.89 
0.81 
0.99 
0.52 



3258.39 
7182.76 
107.36 
6473.05 
9674.64 
1785.73 
14964.11 
8828.81 
3637.68 



1.00 
1.00 
1.00 
0.89 
1.00 
0.86 
0.66 
1.00 
0.74 



2231.63 
7182.76 

140.97 
6150.70 
9781.43 

517.91 
10110.16 
8549.86 

812.28 







0.5 



= 5 
HD 
IS 

LLA 
HD 

IS 

LLA 
0.75 HD 

IS 

LLA 



12.35 
1072.70 
0.01 
28.71 
6793.73 
0.38 
4998.08 
11191.83 
1217.39 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.88 
1.00 
0.90 



12.35 
1072.70 
0.05 
28.71 
6793.73 
0.54 
4963.08 
11188.02 
1252.65 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.88 
1.00 
0.89 



12.35 
1072.70 
0.01 
28.71 
6793.73 
0.49 
4292.65 
12029.12 
1060.08 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.97 
1.00 
0.99 



13.00 
1114.13 
0.01 
0.43 
6831.01 
0.43 
5753.61 
11917.77 
861.72 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.92 
1.00 
0.95 



13.00 
1114.13 
0.07 
0.42 
6831.01 
0.58 
5772.76 
11971.47 
937.76 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
0.95 
1.00 
0.94 



13.00 
1114.13 
0.01 
0.43 
6831.01 
0.57 
5192.19 
12485.14 
1018.59 



A= 10 
HD 
IS 

LLA 
0.5 HD 

IS 

LLA 
0.75 HD 
IS 

LLA 



0.00 
0.00 
0.00 
6.69 

2883.52 
0.03 
122.19 

8835.88 
0.08 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
6.69 

2883.52 
0.20 
122.19 

8835.88 
0.54 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
6.69 

2883.52 
0.03 
122.19 

8835.87 
0.32 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
5.80 

2906.35 
0.02 
107.93 

8874.85 
0.10 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
5.80 

2906.35 
0.20 
107.93 

8874.85 
0.53 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
5.80 

2906.35 
0.02 
107.93 

8874.84 
0.35 







0.5 



= 20 
HD 
IS 

LLA 
HD 

IS 

LLA 
0.75 HD 
IS 

LLA 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
21.76 
3997.88 
0.05 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
21.76 
3997.88 
0.43 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
21.76 
3997.88 
0.06 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
17.70 
4014.29 
0.07 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
17.70 
4014.30 
0.38 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
17.70 
4014.29 
0.08 
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As before, we only considered comparing solutions that use the same combination of tuning parame- 
ters; for the present example, the set considered here is A = {0.001, 0.01, 0.05, 0.1, 0.2, 1.00}, reflecting a 
need to accommodate the different scaling of the pr oblem. The data gene ration scheme for this example 
was loosely based on the simulation study found in Friedman et al.1 (120081) . Binary response data were 
generated according to a logistic (rather than linear) regression model using p { = [1 +exp(-x^8*XT 1 , i = 
1, . . . , TV = 1000, where ft* is a ^-vector with elements Bj = 3 x (-iy exp(-2(j - l)/200), j-l,...,q, 
q e {25,75}, and the remaining 100 - q components set to zero. Here, x, follows a /^-dimensional 
multivariate normal distribution with zero mean and covariance £ = 3~ 2 P where correlation matrix P 
is such that each pair of predictors has the same population correlation p. We considered three such 
correlations, p e {0.0,0.5,0.75}. 

For the B - 100 simulations, the maximum (across different tuning parameters) average normed 
difference between the existing and proposed methods, multiplied by 10 s , are reported for each of the 
strictly convex cases in the right-most panel of Table [TJ As before, these maximums are generally 
remarkably small, indicating that MIST can recover the same (unique) solutions as the existing algo- 
rithms. The results for SCAD are reported in Table |4j which displays the same information as in the 
corresponding tables from Example 1 ; the HD comparisons are omitted here as the methodology and 
code were only developed for the cas e of penalized leas t-squares. In the GLM setting, the IS estimator 
is computed by applying the LARS (|Efron et all 120041) algorithm to a quadratic approximation of the 
negative loglikelihood function evaluated at the MLE. Thus, IS takes 'one step' towards minimizing 
the objective function; in contrast, both MIST and LLA iterate until a stationary point, usually a local 
minimizer, is found. As in the linear model case, LLA uses glmpath to minimize the surrogate at each 
step, whereas the MIST algorithm uses a single application of the soft thresholding operator to minimize 
the surrogate at each step. 

In this example, the starting value carried even greater importance in comparison with the linear 
model setting. In particular, in the case of MIST, the combination of a p starting value and small 
penalty parameter led to solutions with objective function evaluations that were substantially larger in 
comparison with those obtained using either B m[ and B ls A . Such behavior may be directly attributed to 
the fact that the ML and IS stalling values either minimize or nearly minimize the negative loglikelihood 
portion of the objective function, the dominant term in the objective function when A is "small." In 
contrast, a P starting value led to the best minimization performance for "large" A; upon reflection, this 
is also not very surprising, since large penalties induce greater sparsity and p is the sparsest possible 
solution. 

There were a few settings in which the IS estimator resulted in a lower objective function evaluation 
in comparison with applying MIST started at B ml . This reflects the fact that several local minima can 
exist for non-convex penalties like SCAD. In addition, and as was observed before, using the IS solution 
as a starting value always led to MIST finding a solution with a lower evaluation of the objective function 
in comparison with that provided by the IS solution. Regarding the use of LLA, which also requires 
a starting value specification, we again examined the cases for which LLA resulted in lower objective 
function evaluations. For these cases, all MIST solutions were LLA solutions, and all LLA solutions 
were MIST solutions with the exception of one. Hence, both methods find valid, if often different, 
solutions, a behavior that we again attribute to the differences in paths taken towards a solution. 
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Table 4: Algorithm performance in Example 2 (GLM) for 
normed differences (xlO 3 ) between the MIST solution and the 
of MIST solutions whose objective function evaluation was 
solution. 



SCAD penalty. The column 'avg' is the average 
existing method's solution ; 'prop' is the proportion 
less than or equal to that of the existing method's 



25 



0m 

p method 



0„ 



avg prop 



A 



'ml 



avg prop 



I.V..1 



avg prop 



: 75 



o„ 



avg prop 



avg prop 



avg 



001 
IS 

LLA 
IS 
LLA 
0.75 IS 

LLA 







0.5 



26.50 
18.55 
33.90 
27.65 
56.29 
46.48 



0.27 
0.68 
0.15 
0.64 
0.04 
0.71 



0.39 
0.15 
0.08 
0.01 
0.06 
0.05 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.39 
0.13 
0.07 
0.00 
0.05 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



31.70 
17.31 
35.43 
18.45 
42.85 
26.05 



0.42 
0.76 
0.26 
0.82 
0.23 
0.82 



0.22 
0.22 
0.10 
0.10 
0.05 
0.04 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



0.18 
0.11 
0.07 
0.00 
0.04 
0.00 







01 
IS 
LLA 
0.5 IS 

LLA 
0.75 IS 

LLA 



945.60 

416.15 
1082.65 

427.10 
1462.74 

548.07 



0.11 
0.64 
0.00 
0.72 
0.00 
0.79 



30.65 
5.49 

23.60 
1.33 

16.81 
1.71 



1.00 
0.93 
1.00 
0.99 
0.98 
0.97 



31.42 
1.86 

22.97 
0.03 

17.37 
0.82 



1.00 
0.99 
1.00 
1.00 
1.00 
1.00 



1318.20 
406.62 

1088.23 
398.05 

1629.73 
578.09 



0.02 
0.72 
0.01 
0.74 
0.00 
0.79 



8.61 
0.98 
5.62 
0.56 
5.53 
1.73 



1.00 
1.00 
1.00 
0.99 
0.99 
0.99 



8.61 
0.49 
5.75 
0.16 
4.97 
0.06 







05 
IS 
LLA 
0.5 IS 

LLA 
0.75 IS 

LLA 



1845.64 
75.94 
4351.14 

394.16 
5041.69 

337.48 



0.99 
0.99 
0.33 
0.60 
0.97 
0.90 



501.45 
93.46 
433.10 
125.51 
359.74 
124.48 



1.00 
0.73 
1.00 
0.74 
1.00 
0.67 



530.14 
76.33 

473.27 
74.17 

379.26 
46.58 



1.00 
0.98 
1.00 
0.94 
1.00 
0.91 



9575.27 
97.80 

8323.46 
106.69 

7907.54 
24.37 



0.82 
0.91 
0.98 
0.87 
1.00 
0.98 



252.36 
27.73 

171.08 
15.59 

156.65 
31.31 



1.00 
0.96 
1.00 
0.96 
0.99 
0.95 



263.41 
13.86 
181.11 

9.10 
164.34 

2.19 



A = .1 
IS 

LLA 
0.5 IS 

LLA 
0.75 IS 

LLA 



4095.33 
0.00 

4330.64 
4.56 

4536.24 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



818.64 
0.04 
660.87 

32.36 
626.38 

81.21 



1.00 
1.00 
1.00 
0.93 
1.00 
0.87 



815.48 
15.14 

682.83 
34.80 

693.65 
87.10 



1.00 
1.00 
1.00 
0.99 
1.00 
0.99 



8626.86 
0.00 

7626.58 
0.00 

7457.80 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



834.01 
73.78 
628.29 
115.84 
550.76 
88.95 



1.00 
0.89 
1.00 
0.85 
1.00 
0.86 



832.92 
149.55 
718.12 
121.60 
618.94 
62.41 



A = .2 
IS 

LLA 
0.5 IS 

LLA 
0.75 IS 

LLA 



3712.07 
0.00 

3768.77 
0.00 

3825.82 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



2888.10 
0.04 
3167.21 
42.80 
2542.80 
404.72 



0.81 
1.00 
0.98 
0.99 
0.97 
0.83 



3712.07 
0.01 
3602.53 
70.75 
3076.24 
387.72 



1.00 
1.00 
1.00 
1.00 
1.00 
0.86 



4346.96 
0.00 

3781.29 
0.00 

4331.74 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



4346.96 
0.01 

3781.29 
0.01 

4331.74 
0.01 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



4346.96 
0.01 

3781.29 
0.01 

4331.74 
0.01 



A = 1 
IS 

LLA 
0.5 IS 

LLA 
0.75 IS 

LLA 



54.18 
0.00 

40.38 
0.00 

32.85 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



54.18 
0.01 

40.38 
0.01 

32.85 
0.01 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



54.18 
0.00 

40.38 
0.00 

32.85 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



61.54 
0.00 

49.01 
0.00 

38.36 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



61.54 
0.02 

49.01 
0.00 

38.36 
0.00 



1.00 
1.00 
1.00 
1.00 
1.00 
1.00 



61.54 
0.00 

49.01 
0.00 

38.36 
0.00 



4.3 Effectiveness of SQUAREM 2 

We explored the effectiveness of SQUAREM 2 , defined in Section [331 when applied to several sim- 
ulated datasets taken from the previous two simulation studies. Table [5] indicates the relative reduc- 
tion in elapsed time ('RRT') and numbers of MM updates, i.e., invocations of mapping M(-), required 
for the original and SQUAREM 2 -accelerated algorithms to converge for five randomly chosen simu- 
lation datasets across the five penalty functions. The SQUAREM 2 algorithm converged without dif- 
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Table 5: Acceleration from SQUAREM 2 applied to simplified MIST algorithm for five randomly selected simu- 
lation datasets. The reduction in elapsed time is given by 'RRT, while the number of MM updates are given for 
the original MIST implementation and SQUAREM 2 implementation in '# orig' and '# sqm 2 ', respectively. 









LAS 






ALAS 






EN 






AEN 






SCAD 






Dataset 


RRT #orig#sqm 2 


RRT 


#orig #sqm 2 


RRT 


#orig 


#sqm 2 


RRT 


#orig 


#sqm 2 


RRT 


#orig #sqm 2 




LM 
































p = 


35,o-= 1 


































62 


0.67 


260 


62 


0.81 


169 


44 


0.63 


46 


26 


0.82 


42 


23 


0.91 


485 


68 




71 


0.76 


221 


59 


0.75 


163 


41 


0.67 


49 
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ficulty in these cases and required substantially fewer MM updates than the original algorithm; the 
pe rcent reduction in time was a s hi gh as 96%. We remark here t hat the regularity conditions imposed 



in 



Roland and Varadhanl ((2005) and IVaradhan and Roland (2008), particularly smoothness conditions, 



are not satisfied in this particular class of examples. Hence, while the simulation results are certainly 
very promising, the question of convergence (and its associated rate) of SQUAREM 2 in this class of 
problems continues to remain an interesting open problem. 



5 Example: Identifying genes associated with the survival of lymphoma 
patients 

Diffuse large-B-cell lymphoma (DUBCU) is an aggressive type of non-Hodgkins l ymph oma and is one 



of the most common forms of lymphoma occurring in adults. iRosenwald et all (|2002h utilized Uym- 
phochip DNA microarrays, specialized to include genes known to be preferentially expressed within 
the germinal centers of lymphoid organs, to collect and analyze gene expression data from 240 biopsy 
samples of DUBCU tumors. For each subject, 7399 gene expression measurements were obtained. The 
expression profiles along with corresponding patient information can be downloaded from their supple- 
mental website http://llmpp.nih.gov/DUBCU/. Since the origin al profiles had so me missing expression 



measurements, we used the dataset subsequently analyzed by lUi and Guil (12004) which estimated the 
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missing values using a nearest neighbor approach. During the time of followup, 138 patient deaths were 
obs erved with median death time of 2.8 years. 

Rosenwald et al. (2002) used hierarchical clustering to group the genes into four gene-expression 



signatures: Proliferation (PS), which includes cell-cycle control and checkpoint genes, and DNA syn- 
thesis and replication genes; Major Histocompatibility Complex ClassII (MHC), which includes genes 
involved in antigen presentation; Lymph Node (LNS), which includes genes encoding for known mark- 
ers of monocytes, macrophages, and natural killer cells; an d Germinal Cen t er B ( GCB), which includes 
genes that are characteristic of germinal center B cells; see lAlizadeh et all (120001 ) for more infor mation 
on gene s ignatures. Based on the gene clusters, they built a Cox proportional hazards model (|Cox . 



19721. 1 19751) to predict survival outcomes in the DLBCL patients. Subsequently, this dataset has been 



analyzed numerous ti mes, typically to evaluate methods rel at ed to subgroup iden t ification and/or sur- 
vival prediction (e.g ., Li and Gui . 2004 ; Gui and Li . 2005bl lal; Li and Luarl 120051; Annest et al. , 2009 : 



Engler and LiL 12009: Ti bshiranil . l2009l) . 



Here, we instead focus on the performance of two different penalties, namely SCAD and MC P, with 
regard to the identification of genes associated with DLBCL survival. The simulation results of Zhang 
(120071) suggest that MCP has superior selective accuracy over the SCAD penalty, at least for the case 
of a linear model. There, selection accuracy was measured as the proportion of simulation replications 
with correct classification of both the zero and non-zero coefficients, with MCP outperforming SCAD 
in all simulation settings. To illustrate the utility and flexibility of the MIST algorithm, we reanalyzed 
the DLBCL data, fitting a penalized Cox regression model respectively using SCAD and MCP penalty 
functions, and run ning these proced ures in combination with the Iterative Sure Independence Screening 
procedure (ISIS, iFan et all l2009bl) in order to ensure that the dimension of the parameter space was 
maintained at a manageable level. For SCAD, we considered both the IS and LLA estimators. The 
existing optimization functions provided in the SIS package for the ISIS procedure were used for the 
IS estimator, whereas relevant modifications to the ISIS code were made in order to accommodate the 
fully iterative LLA and MCP estimators. Optimization at each step of the ISIS algorithm in the case of 
the MCP penalty utilized the MIST algorithm, as we are aware of no other algorithm capable of fitting 
the Cox regression model subject to MCP penalization. The default settings in the SIS package were 
used to determine the maximum number of predictors ([ 41 " ] = 10) and to define the additional ISIS 
parameters (e.g., use of the unpenalized MLE as a starting value, ranking method, tuning parameter 
selection) for all three analyses (1S-SCAD, LLA-SCAD, MIST-MCP). The parameter a was set to 3.7 
for all analyses; hence, only the selection of A required any tuning. 

Table [6] displays the 1 1 genes identified by at least one of the three analyses. The x's in a given 
column indicate the genes with non-zero coefficients resulting from the corresponding penalization. 
The final column provides references for genes previously linked to DLBCL in the literature. Genes 
belonging to the original iRosenwald et all ((2002) gene expression signatures are indicated with paren- 
thetical initials. Note that the references provided are not meant to be an exhaustive list, but instead to 
demonstrate the relevance of certain genes and/or their altered expression levels in DLBCL survival. 

Interestingly, the LLA-SCAD and MIST-MCP penalizations selected the same subset of genes, with 
a nearly a complete overlap with those selected from the 1S-SCAD penalization. The number of genes 
selected in each case is 10, the maximum specified by ISIS; 9 of these were shared across the three 
penalizations. According to NCBI Entrez Gene search (http://www.ncbi.nlm.nih.gov/), many of these 
genes are biologically relevant. For example, CDK7 codes for a protein that regulates cell cycle progres- 
sion and is represented in the Proliferation Signature, although reported under a different Lymphochip 
ID as this gene was spotted multiple times on the array. Also members of the Proliferation Signature are 
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SEPT1, coding for a protein involved in cytokinesis, and BUB3, coding for a mitotic checkpoint protein. 
DNTTIP2 regulates transcriptional activity of DNTT, a gene for a protein expressed in a restricted pop- 
ulation of normal and malignant pre-B and pre-T lymphocytes during early differentiation. HLA-DRA, 
a member of the MHC Signature, plays a central role in the immune system and is expressed in anti- 
gen presenting cells, such as B lymphocytes, dendritic cells, macrophages. From the GCB Signature, 
the ESTs weakly similar to thyroxine-binding globulin precursor is highly c ited. Additionall y, RFTN1 
plays a pivotal role in regulating B-cell antigen receptor-mediated signaling dSaekietallE003h . 

A description of AI568329 was not provided in the original dataset, thus its function is unknown. 
Similarly, although cited at least twice, a description for AA830781 was also not provided in the original 
dataset. However, both of these may be related to lymphoma or risk of death from lymphoma, as it is 
possible that these genes (and potentially others) were selected because of coexpression or correlation 
with other relevant genes. 

Interestingly the two genes n ot commonly identified across the three penalizations were both cited 



in 



Martinez-Climent et all (120031 ). They found altered gene expression of TSC22D3 and ITGAL (both 



involved in a variety of immune phenomena) in one case who initially presented with follicle center 
lymphoma and subsequently transformed to DLBCL. 



Table 6: Genes associated with DLBCL survival with SCAD (one-step=lS and LLA) and MCP penalizations, 
sorted by the gene order in the original data set. ID refers to the unique Lymphochip identification number. The 



x's in a given column indicate the genes identified by the corresponding penalization. 



ID 


Name (Symbol) 


SCAD 


MCP References 






IS 


LLA 






27774 


cyclin-dependent kinase 7 (CDK7) 


X 


X 


X 


Rosenwald et al. (2002) (PS), Ma and Huang (2007) 
Binder and Schumacher (2008, 2009) 


31242 


acidic 82 kDa protein mRNA (DNTTIP2) 


X 


X 


X 


Binder and Schumacher (2008, 2009) 


31981 


septin 1 (SEPT1) 


X 


X 


X 


Rosenwald et al. (2002) (PS), Li and Luan (2005) 
Sinisi et al. (2006). Sha et al. (2006) 
Zhang and Zhang (2007).Annest et al. (2009) 
Binder and Schumacher (2008, 2009) 


29652 


BUB3 budding uninhibited by benzimidazoles 3 (BUB3) 


X 


X 


X 


Rosenwald et al. (2002) (PS) 


27731 


major histocompatibility complex, 
class II, DR alpha (HLA-DRA) 


X 


X 


X 


Rosenwald et al. (2002) (MHC),Li and Luan (2005) 
Gui and Li (2005a.b).Solin et al. (2009) 
Binder and Schumacher (2009) 


24376 


ESTs, Weakly similar to A47224 
thyroxine-binding globulin precursor 


X 


X 


X 


Rosenwald et al. (2002) (GCB).Ando et al. (2003) 
Gui and Li (2005a,b).Li and Luan (2005) 
Annest et al. (2009). Sohn et al. (2009) 
Binder and Schumacher (2008, 2009) 


22162 


delta sleep inducing peptide, immunoreactor (TSC22D3) 




X 


X 


Martinez-Climent et al. (2003) 


23862 


(AI568329) ESTs 


X 


X 


X 




24271 


integrin, alpha L (ITGAL) 


X 






Martinez-Climent et al. (2003) 


33358 


(AA830781) 


X 


X 


X 


Li and Luan (2005) 

Binder and Schumacher (2009) 


32679 


KIAA0084 protein (RFTN1) 


X 


X 


X 


Gui and Li (2005b), Sha et al. (2006) 
Zhang and Zhang (2007).Annest et al. (2009) 
Binder and Schumacher (2008, 2009) 



The results of this analysis demonstrate equivalence in selection performance between MCP and 
LLA-SCAD for the case of Cox proportional hazards model. Increasing the maximum number of pre- 
dictors to 21 again resulted in equivalent selection performance between MCP and LLA-SCAD, with 
21 predictors ultimately selected (results not shown). The IS estimator also resulted in the selection 
of 21 predictors, but with increased dissimilarity between MCP/LLA-SCAD and IS: only 13 of the 21 



22 



genes were selected by all three methods. It should be noted that iZhangl (12007b did not use any form 
iterative variable selectio n (e.g., ISIS) i n his comparisons between SCAD and MCP for the case of the 
linear model; in addition, IZhangl (120071) fixed values for both penalty parameters in his simulations and 
also did not use a = 3.7. Thus, use of the ISIS p r ocedur e, the particular method used for selecting A, 
and the use of a = 3.7 (as suggested in Fan et all (I2009bl) ) in both the MCP and SCAD penalties may 
all play a role in the results summarized above. 



6 Discussion 

This paper proposed a versatile and general algorithm capable of dealing with a wide variety of nons- 
moothly penalized objective functions, including but not limited to all presently popular combinations 
of data fidelity and penalty functions. We established a suitable convergence theory, as well as new 
results on the convergence of general MM algorithms. We also demonstrated the remarkable effective- 
ness of the SQUAREM 2 acceleration procedure in these problems as tool for accelerating the slow but 
steady convergence of the proposed class of MM algorithms. Beyond specification of the penalty pa- 
rameters) X, virtually no effort was expended in tuning or otherwise specializing the MIST algorithm 
for solving a given problem. Thus, at the expense of greater analytical work, the convergence rate of the 
MIST algorithm can likely be improved. Through the use relaxation techniques and other methods for 
controlling the step-size behavior (e.g., line-searches) of MIST, we further expect that the local nature 
of the convergence theory presented here can be made global in nature. 

The simulation results of this paper highlight the fact that nonconvex penalties tend to endow the 
corresponding objective function with multiple local minima. The resulting sensitivity of computational 
algorithms to the choice of starting value, while k nown, has arguably been deemphasized in the current 



literature. In this regard, the one-step method of IZou and Lil (120081) provides a meritorious choice of 



starting value for fully iterative SCAD-based algorithms. In addition to being unique under mild regu- 
larity conditions, it is easily generalized to other nonconvex penalties, such as MCP. Unfortunately, the 
utility of this approach for identifying starting values is also limited to settings where N > p, for the 
justification of the IS estimator relies heavily on the use of the unpenalized MLE as its starting value. 

The simulated examples in this paper only consider settings with N > p, mainly to ensure that 
m(fi) is strictly convex. Specifying e > in the ridge-like penalty term ensures that m(fi) is strictly 
convex provided only that g(fi) + h(J3, a) is convex, as might be encountered in cases where N < p. 
Thus, for example, one might consider combining the ridge term with any penalty satisfying condition 
(PI) (e.g., SCAD), providing alternatives to the elastic net penalty; our results on the convergence of 
the proposed algorithms to some stationary point of the objective function would continue to apply in 
this setting. It would be interesting to investigate the statistical properties of estimators derived under 
such combinations in settings where p > N but po <<c N, with po denoting the number of "important" 
predictors. 
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A Appendix 



This appendix is divided into several se ction s . Sec tion fA. 11 reviews and extends the converg ence the- 
ory for the EM algorithm established in lWul (11983b : the extension utilizes results of iMeyer dl976h to 
establish stronger convergence results for general MM algorithms. Section IA.2I contains the proof of 
Theorem 12.11 and makes direct use of these new convergence results. Finally, Sections IA.3I and IA.4I 
respectively contain the proofs of Theorems 13.21 and 13.41 establishing the convergence of iterated soft 
thresholding when used to minimize (TTQb and convergence of the proposed class of MIST algorithms in 
the case of the generalized linear model. 



A.l Local convergence of MM algorithms in nonsmooth problems 



Using con vergence theory for algorithms derived from point-to-set maps developed by lZangwilll (119691) . 
Iwul dl983h established the convergence of the EM algorithm assuming twice diffe rentiability of th e 



loglikelihood function. In what follows, we first restate t he key convergence result of lZangwilll ([1969); 
this result, given in Theorem lA.il and adapted from Iwul ( 1983 ). is stated in a form convenient for use 
with the MM algorithm and provides for a very gener al (and compar atively weak) form of convergence. 
We then draw on stronger convergence results due to iMeyen (|1976h in order to establish a more useful 
convergence theory for MM algorithms designed to minimize nondifferentiable objective functions; this 
result is stated in Theorem IA.3I Finally, we provide a set of sufficient regularity conditions that ensure 
the validity of the conditions of both theorems in a wide class of statistical estimation problems. 

Let g(J3) be the real- valued function to be minimized, where fi € S and S is some convex subset of 
W. Let M : S — > S be the minimization map O, where f UR (-,-) is any function that majorizes 
for {} e S. In general, M(-) is a point-to-set map, and therefore a set. We say that fi is a generalized 
fixed point of M(-) if / ? 6 M(0); we say that jS is a fixed point of M(- ) if M(fi) = {/?} (i.e., a singleton). 

The main result of lZangwilll (I1969L Theorem A), also utilized in lWul (119831) . is stated below. 



Theorem A.l. Suppose t;(fi) is a continuous, real-valued function of/3 e S that is uniformly bounded 
below. Let S c S denote the (nonempty) set of stationary points of ^(fi) for /? € & and assume the 
sequence [fir', k > 0} is generated as follows: 

• fi (0) € S, where J3 m and %(fi (0) ) are bounded; 

• k+l) € M(fi {k) ), where M(-) is the point-to-set map ©■ 
Suppose that 

Zl. Each k) € So, where the compact set So c S; 
Z2. M(-) is closed and non-empty for e S c n Sq. 
Z3. We have: 

(i) £(/?) < <;{a)for each a € S and any € M(a); 

(ii) £(/?) < <;{a)for each a £ S and any € M(a). 

Then, the following conclusions hold: 

Ml. The sequence {0 k \ k > 0} has at least one limit point in S, and the set of all limit points, say So, 
satisfies So £ <S; 



28 



M2. Each limit point fi e So satisfies lim^oo £(/8^) - 

M3. Each limit point f3 6 So is a generalized fixed point ofM(-). 



Remark A.2. Assumptions [Z1]-[Z3] are imposed in \Wu\ hl983\) . The assumption [Zl] implies that 
k > 0} is a bounded sequence, ensuring the existence of at least one limit point. Further comments 
on [Z2] will be made below, as it is possible to impose reasonable sufficient conditions that ensure this 
condition. The assumption [Z3 ] enforces the descent property at each update, as would be expected in 
any EM, GEM or MM algorithm. An equivalent formulation of this condition follows ( e.g. Meyer 197 'd 
p. 114): 

Z3' . For each a € So and fi e M{a) : 

(i) %(fi) < £(a) if a M(a) (i.e., a strict decrease occurs at points a that are not generalized 
fixed points); 

(ii) %(fi) < t;(a) if a € M(a) {i.e., if a is a generalized fixed point, it is possible to observe no 
change in the objective function). 

The above theorem essentially guarantees convergence of subsequences, but not global convergence 
of the iteration sequen ce itself. Subsequ ential convergence permits, for example, oscillatory behavior 



in the limit sequence. Meyerl (| 1 976L 1 1 91m offers several refinements of Theorem lA.il strengthening the 



statements of convergence. His results, adapted for the MM algo rithm, follow below; in particular, see 



Theorem 3.1, Corollary 3.2, and Theorems 3.5 and 3.6 of lMeyerl (119760 . 



Theorem A.3. Let the conditions of Theorem \A. 1 1 hold. Consider the following two additional condi- 
tions: 

Z4. For each a e So and any f3 e M(a), we have %(fi) < %(a) whenever M(a) + {a} (i.e., a strict 
decrease in the objective function occurs at any point a that is not a fixed point); 

Z5. there exists an isolated limit point f3* such that M(fi*) = {/?*} (i.e., a true fixed point). 

Suppose [Z1]-[Z4] hold. Then, in addition to results [M1]-[M3] of Theorem IA.il the following 
conclusions hold: 

M4. Each limit point € So satisfies M(JS) = {/?}, and is therefore a fixed point ofM(-); 

M5. lim^oo 1^8® — - 0, in which case one either has (i) the set of limit points So consists of a 

single point to which /J® converges; or, (ii) the set of limit points So forms a continuum, and /J® 
fails to converge; 

M6. If the number of fixed points having any given value of^(-) is finite, then {0~ k \ k > 0} converges to 
one of these fixed points; 

M7. If the sequence {ffi k \k > 0} has an isolated fixed point fi, then ff k ' —>J3.Ifj3 is also an isolated 
local minimum of on So, then there exists an open neighborhood S e c So of fi such that 
fi (k) ^fiif/3 {0) eS e . 

Suppose instead that [Z1-Z3] and [Z5] hold. Then, in addition to results [M1]-[M3] of Theorem 
L4.il the following conclusion can be drawn: 
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M8. If the sequence {0 k \k > 0} has an isolated generalized fixed point that satisfies M(J3) = {/?}, 
then k) ft. If fi is also an isolated local minimum of on Sq, then there exists an open 
neighborhood S £ c S offi such that j8 (fc) -» p ifj3 {0) e S e . 

Remark A.4. Assumption [Z4] strengthens [Z3] by imposing the condition that the iteration scheme is 
strictly monotonia as such, all generalized fixed points of M(-) are also fixed points, a situation that 
permits stronger statements of convergence results. Assumption [Z5] imposes the somewhat weaker 
assumption that there exists at least one isolated fixed point of the iteration sequence; similarly to [M7], 
[M8] implies that the iteration converges to this point. 



Conclusions [M1]-[M7] essentially mirror those in ( Vaida, 2005, Theorems 1-3), who obtains strong 
convergence results for the EM and MM algorithms under global differentiability assumptions on the 
objective and majorization functions and the additional condition that £ SUR (J3,a) has a unique global 
minimizer in fi for each a € S, where S is a finite set of isolated stationary points. This uniqueness 
condition, encapsulated in [Z4], provides a verifiable condition for convergence of the MM algorithm 
that is often satisfied in statistical applications. 



Sufficient conditions that ensure [Z1]-[Z4], but weaker than conditions imposed in IVaidal (|2005), 
are now provided. In particular, suppose that the objective function, its surrogate and the mapping M(-) 
satisfy the following regularity conditions: 

Rl. %(fi) is locally Lipschitz continuous and coercive for/? € S; that is, L(£(z)) = {b £ S : £(b) < £(z)} 
is compact for each z e S. Consequently, achieves a finite minimum somewhere interior to 
S; assume the set of stationary points S is finite and isolated. 

R2. £08) - % SUR (J3,P) for each fi e 8. 

R3. % SUR (fi, a) > £ SUR (fi,P) for * or, j8, or e S. 

R4. g SUR (ft, a) is continuous for (or,jS) € S x S and locally Lipschitz in fi for fi near a. 

R5. M(JS) exists and is a singleton set for each fi e S. 

The above conditions do not imply that the objective function £(/?) is differentiable everywhere. 
Condition Rl does imply that £(/?) is bounded for fi interior to S and that V£(/?) exists for almost all jS. 
Conditions R2 and R3 imply that £ SUR (J3, a) strictly majorizes gift) and, in addition, 

f UR (fi,a) = m + ^(fi,a), (21) 

where ip{J3, a) := f UR {J3, a) - %(fi) satisfies i//(fi, a) > for a + and i/r(fi,fi ) = . Assu mptions R4 



& R5 imply that the map M(fi) is continuous, hence bounded on compact sets (IPolald. 1 19871 . Prop. 3.2). 
Conditions Rl, R4, and R5 further imply that (|2TI) is bounded below for (or,jS) eSxS and that i/^(yS, a) 
is uniquely minimized at a = ft for any fixed point fi. 

Suppose conditions R1-R5 hold. As commented earlier, conditions R4 and R5 imp ly that M(J3) is 



a continuous point-to-point map; hence, M(-) is closed (e.g. Luenberger and Ye , 20081 pp. 203-204), 
establishing [Z2]. Propositions IA.5I and IA.61 given below and proved under conditions R1-R5, now 
establish [Zl], [Z3] and [Z4]. 
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M(fy">) exists, is bounded 
(22) 



(23) 



Proposition A.6. Let jS (0) be bounded. Define ^ n) = %(fi {n) )for n > 0. Then, n > 0} is a bounded, 
monotone decreasing sequence. Moreover, the sequence {0- n ',n > 0} is bounded and contained in the 
compact set L(^). 

Proof of Proposition IA.51 Let a be bounded but otherwise arbitrary. The continuity of M(-), along with 
assumption R5, implies that M(a) exists, is bounded, and is unique. Using £T|) and Assumption R2, we 
have that £ SUR (M(a), a) < % SUR (a, or) = %(a) < oo. Hence, (|22]> holds upon setting a = /3 {n) . 
To establish d23), note that dZB, ([22]> and the definition of j8 (,,+1) imply 

f u *{pF* v >,fF>) = ^os (,,+1) ) + ip(j3 in+l \fi n) ) < oo. 

By d22]) and the fact that € SUR (fi <n) ,ft n) ) = %(J$ n) ) + <p(fi {n) ,P {n) ) = ^(fi {n) ), we further observe 

£08 ( " +1) ) + (A0S ( " +1) ,)8 ( ' ,) ) < %(fi {n) ). 

from which d2"31 is immediate. Under R3 and R4, this inequality is necessarily strict unless 0- n+Y) = 
M(fi (n) ) = jS (,,) , proving the result. □ 

Proof of Proposition |A. 61 Since fi (0) is bounded, Assumption Rl implies £ (0) is bounded, fi (0) e L(£ (0) ), 
and L{^) is compact. From Proposition IA.5I and Assumption R5, we further observe that yS (1 ^ = 
M(/? (0) ) is bounded and satisfies e L(g (0) ). Using Assumption Rl once more, = £(/? (1) ) is 
bounded and, by g3), satisfies gV> < ^ 0) ; thus, L(^ 1} ) c L(^ 0) ). 

We now use induction. Let 0- n) be bounded for some n > 1 and satisfy < ^^ 0) ; then, tf 1 ^ is 
necessarily bounded and n) e L(^" ) ) c L(£ (0) ). It again follows from Proposition I A. 51 and Assumption 
R5 that y8 (n+1) = M(fi (n) ) is bounded and satisfies fi (n+l) € L(£ (n) ). Hence, is bounded and satisfies 
^(«+D < < ^r(O) Consequently, we have L(£ ( " +1) ) c L(£ (n) ) c L(£ (0) ) and fi {n+l) e L(£ (0) ) and it now 
follows that < £ n \ L(£ ( " +1) ) c L(^) c L(£ (0 >), and j8 (n) e L(^) for n > 0. Since £(•) is bounded 
below, {^ (,,) , « > 0} evidently forms a bounded, monotone decreasing sequence and {0" n \ n > 0} forms a 
bounded sequence contained wholly within the compact set L(g (0 ^). □ 



A.2 Proof of Theorem |2J] 

The assumptions stated in the theorem immediately yield that £(/?) is locally Lipschitz continuous and 
coercive for each bounded A > 0, hence (i) is satisfied. 



Proposition A.5. Suppose p~ n ' is bounded for a given n > 0. Then, jS (,l+ ) 
and is unique. In addition, for n>0, 

and 

£08 (n+1) ) - %(fi (n) ) < -i/,(/3 in+1 \/3 (n) ) < o, 

where the second inequality is strict unless ft' t+ ^ = M(0 n ^) = fr n \ 
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To show (ii), we first write 

p 

q(fi,a;A)-p(fi;A) = £ [qQpjWlAj) ~ PQPMj)] 

i=i 
p 

= J] [p{\aj\; Aj) + Aj)Q/3j\ - \ aj \) - p{]fij\; Aj)] . (24) 

7=1 

This difference is obviously equal to zero whenever fi = a. For fi + o, we shall separately consider the 
case where p(r; Aj) is linear versus nonlinear. 

First, suppose that p(r; 0) - a\ + a 2 r, where a\ > and a 2 > and each may depend on 6. It then 
follows immediately that 

p(\ aj \; Aj) + p'(\aj\; Aj)(\j3j\ - \aj\) - p(]/3j\; Aj) = {a Y + a 2 \aj\) + a 2 (\Pj\ - \aj\) - (aj + a 2 \Pj\) = 0. 

Thus, the claimed equality between © and © holds in this case. 

Now, suppose that p(r; 6) is nonlinear in r. Under (PI), we claim that (0]) strictly majorizes p(J3; A) 
provided the derivative of the penalty p'(-, Aj) is strictly positive. To see this, observe that concavity 
(e.g., see ©) implies the inequality 

q(r, s; 6) - p(r; 0) = -I [p(r; 6) - p(s; 6) - p'(s; 0)(r - s)] > 0, 

with equality holding if and only if r - s and p'{s; 6) > 0. For penalties such that their derivatives 
are nonnegative, i.e., p'(s;6) > 0, we obtain the same inequality as above, with equality additionally 
holding for r and s sufficiently large. Therefore, 

p 

q(fi, a; A) - p(fi; A) = £ [qQPj\, \aj\; Aj) - p{\pj\; Aj)] > 0, 

7=1 

and (ii) is established. 

In order to establish the majorization property specified in (iii), we begin by noting that our assump- 
tions on #08), h{fi, a), and p(-\ 9) imply that % SUR (J3, a) and i/K/?, a) = h(fi, a) + q(J3, a; A) - p(fi\ A) are 
both continuous in fi and or. Our assumptions further imply that a) > 0; if at least one of h{fi, a) or 
q(J3, a; A) - p(J3; A) is strictly positive for fi + a, then tfr(fi, a) > for or ± fi and i{t(J3,fi) = 0. Therefore, 
the objective function g(J3) is strictly majorized by g SUR (J3, a) = £(/?) + tpifi, a). 

In order to establish the convergence of the corresponding MM algorithm in (iii), it suffices to prove 
that the assumptions of the theorem and consequent assertions established thus far are sufficient to ensure 
that Conditions R1-R5 of Appendix IA. H are met, in which case Theorem I A. 31 applies directly. The result 
(i), combined with the assumption that the stationary points are all isolated, immediately establishes 
Condition Rl ; as proved above, conditions R2 and R3 also hold. If ifj(fl, a) = h(fi, a)+q(J5, a; A)-p{fi; A) 
is continuous in a and fi and locally Lipschitz continuous in fi near a, then (i) implies that R4 also holds. 
By assumption, h(J3, a) is continuous in a and continuously differentiable in /?, hence locally Lipschitz 
in j8. Continuity of q{fi, a; A) - p(fi; A) in both or and fi is also immediate. Hence, R4 holds provided 
that qifi, a; A) - pifi; A) is locally Lipschitz continuous in j8 near a. To see that this is the case, we note 
that (|24l is a linear combination of functions in fij of the form p'(|a,|; Ai)\Pj\ - p(|/8/|; Aj), where | • | and 
-p(-; A) are both convex, hence locally Lipschitz. Since both the sum and composition of two locally 
Lipschitz functions are locally Lipschitz, the result now follows. Finally, R5 is ensured by R1-R4 and 
the condition in (iii) that £ s UR (J3, a) is uniquely minimized in j8 for each a. 
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A.3 Proof of Theorem 1551 

Under the stated conditions and for any bounded a, m(J}) = gifi) + h(fi, a) + /le||/i|| 2 is strictly convex 
with a Lipschitz continuous derivative of order L~ l > 0; in addition, J^ p - =i p' (\a j\; Aj)\/3 j\ is also convex 
in /?. Hence, for each bounded a there exis t s a un ique solution fi* = fi*(a) when minimizing (fTOl) . 

In the notation of Combettes and Wajsl (2005), we may identify the Hilbert space 'H with R p , /2OS) 



with m(fi) and f\(fi) with Yfs = \ P'(\<Zj\l Th e assumptions of the theorem ensure that the regularity 

conditions of Proposition 3. 1 and Theorem 3.4 of Combettes and Wajsl ( 2005 ) are met. In particular, be- 



cause m(fi) is coercive and strictly convex, Proposition 3.1 guarantees the existence of a unique solution 
to 

min fM + MP) 



as well as provides the relevant fixed point mapping; Theorem 3.4 establishes the weak convergence 
of the corresponding iterative scheme to this unique solution. Since weak convergence is equivalent 
to strong convergence in a finite dimensional Hilbert space, such results imply componentwise conver- 
gence of the resulting iteration sequence to 



Both Proposition 3.1 and Theorem 3.4 of ICombettes and W ajs (2005) r ely on the gradient o f /2O8) 



and the so-called "proximity operator" of f\(fi). Example 2.20 in ICombettes and Wajsl (120051) shows 
that the proximity operator for f\(J$) = Yj P = \ p'(\aij\', Aj)\j3j\ is exactly £(•; r). The algorithm summarized 
in the statement of the theorem is therefore observed to be a specific instance of that described in the 
Theo rem 3.4 with (in their notation) a n = b n = and A n = 1 for every n. 

Hale etal. I (I2008L Theorem 4.5) undertake a detailed study of the proposed algorithm for the special 
case of a convex, differentiable fz(fi) and f\(fi) = Zy =1 in this case, they prove that the algorithm 
converges in a finite number of iterations. 

A minor extension of their arguments may be used to establish the same result for f\(JS) = 
EJU P'(\aj\; *j)]Pj\, provided that p'Qajl; Xj) € [0, 00) for each j. 

A.4 Proof of Theorem EH 

To establish (1.), note that the choice of h(fi,a) i n ([TBI with a ppropriate vj guarantees majorization of 
-t(fi) provided V 2 (-^(j8)) can be bounded (e.g., ILangei 12004. Ch 6). Penalties of form (© satisfying 



assumption (PI) can be linearly majorized so that (fT4l majorizes g g i m (fi). For (2.), -t(fi) is indeed strictly 
convex and coercive, with h(fi, a) > continuous in both j8 and a and continuously differentiable in j8 
for each a, with h(fi, a) = when j8 = a. We provide a more detailed proof of (3.) below. Let f = 2tt7 _1 . 
Note that the surrogate £ s UR (fi, a) is differentiable in flj only if (3j ± 0. Assuming fij + for j + and 
excluding irrelevant constants, 



Setting ( 1231 ) equal to zero implies 



= - [W(£K)] y + fflj - {aj + TjSigntfj) + 2Aej3j. (25) 



b = { mr* [WHj + ("J - t j) Pi > 

For sign consistency, we must impose that j^m: ([^AoO]j + C a j) > T j when fij > and 
^ (\VmJ\j + Ca } ) < -Tj when j8; < 0. 
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When g + 2Ae ([^(^)]y + £ a j)\ - 1 '}> we set Pj - 0- In summary, 

from which the first part of ( fT5l > directly follows for j € {l,...,p}. We do not penalize the intercept, 
thus 

= -[W(a)]o + <rA>-£*o 



so that^ = (rW(a)] + £*<>)/£ 

Furthermore, take ft + k for any /? g < R P+1 and a: = {kq,k t ) t € T^ -1 " 1 is arbitrary. Then, following 

arguments similar to those in lDaubechies et al.l (120041 . Prop. 2. 1), 



p 

7=1 



- ?™(fl, or) + (| + ie)*'* + £*g + k (£3 - - [V€(or)] ) 



7=1 

Consider p=p* = [fi* ,fi* T ] T where ft* defined in d), and define sets J = {1,2, . . . ,p], J Q = {j e J : 
P* = 0} and Ji = J\J . Noting that fi*. satisfies (£ + 2Ae)f3* - {aj - [W(or)]y = -r 7 sign(/3J) for j 6 Ji, 
and noting that - £a - [W(ar)]o = 0, we have 



7=1 

= + <^y* + ^0 + 2 h 1 ^ 1 - + 

7'eJo 

For 7 e J7o> + [V€'(S)]y| < tj, so that t/|k/| - Kj(gaj + [W(or)] 7 ) > 0. For e Ji, there are two cases, 
corresponding to the sign of /3*. First consider /?*. > 0, then 

r y (^. + ^|-^|)- W ign08p = Tyfl^ + Kyl-tfJ + */))>(>. 

If/3* < 0,then 

r 7 (^. + ^|-^.|)- W ignOep = Tj(\j3*+Kj\ + (fi} + Kj))>0. 

Thus, £ s ™fj8* + k, or) - % s Jj^(fi*,a) > (| + /le)*'*: + > since > 0, hence guaranteeing a 
unique minimum, and proving the proposition. □ 
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